Introduction

This R Markdown Document explores the relationships between Total Yards Per Game (YPG) and the Points Per Game (PPG) that a team scores in both the National Football League (NFL), as well as the Division 1 Football Bowl Subdivision (FBS). These results are then compared to answer the question of whether PPG and YPG have a differing relationship between the NFL and FBS. This project first examines the individual relationships between the NFL teams’ YPG and their PPG, as well as examining how passing yards per game (PYPG) and rushing yards per game (RYPG) affect the results. This process is then repeated for the FBS, and then the resulting data is used to conclude how similar these relationships are in the FBS versus the more competitive NFL.

NFl Data

NFL Visualizations

Below is a table that attempts to show a relationship between PPG and YPG in the NFL.

Every NFL team’s rank in PPG and YPG
Team YPG Rank PPG Rank
10 Chiefs 1 1
4 Bills 2 2
15 Eagles 3 3
13 Cowboys 11 4
20 Lions 4 5
1 49ers 5 6
3 Bengals 8 7
32 Vikings 7 8
28 Seahawks 13 9
18 Jaguars 10 10
14 Dolphins 6 11
24 Raiders 12 12
9 Chargers 9 13
21 Packers 17 14

In this table, the top 14 NFl teams are sorted by their rank (out of 32) in PPG in ascending order. Their rank in YPG is also shown. This table attempts to show that, in general, the teams ranked highly in PPG are also ranked highly in YPG. For example, the top three teams in PPG, the Kansas City Chiefs, Buffalo Bills, and Philadelphia Eagles respectively, are also the top three teams in YPG, in the same order. The fourth ranked team in PPG, the Dallas Cowboys is only 11th in YPG, so the pattern is not true for all 32 teams. However, of the top 10 teams in PPG, only two are outside the top 10 in YPG, the aforementioned Cowboys and the Seattle Seahawks. Those two teams are ranked 11th and 13th in YPG respectively. The fact that the top 10 teams in PPG are all top 13 in YPG is an excellent indicator that there may be a relationship between PPG and YPG. We will conduct further analysis to see whether there really is a relationship between PPG and YPG or not.

Going further into that exploration, below are three scatterplots. The first shows the relationship between PPG and YPG in the NFl. The second shows the relationship between PPG and PYPG iun the NFl. The third shows the relationship between PPG and RYPG in the NFL.

## Warning: Using the `size` aesthetic with geom_line was deprecated in ggplot2 3.4.0.
## ℹ Please use the `linewidth` aesthetic instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

All three scatterplots show a positive, linear relationship. The first two, PPG vs YPG and PPG vs PYPG respectively, show a very clear and obvious positive linear relationship. The third plot, PPG vs RYPG, shows a less clear and obvious positive linear relationship. There appears to be some correlation, but not a lot. Regardless, the most important thing is that y increases as x increases. This is true for all three plots, whether x is YPG, PYPG, or RYPG.

NFL Analysis

Simple Linear Regression

We decided to perform simple linear regression to see if PPG and YPG are related. In order to infer, some conditions must be met. The data must be linear, have equal variance, and must be normal. Below are two plots that asses whether or not the conditions are met.

## `geom_smooth()` using formula = 'y ~ x'

The first condition that must be met is linearity. The explanatory and response variables must have a linear relationship. This was proven by the first scatterplot of PPG and YPG. The linearity condition, therefore, is met. The second condition that must be met is equal variance. The first plot above is a scatterplot of residuals versus fitted values. The points in this plot must show an equal spread across the width. If they do, then the data have equal variance. Our residual vs fitted plot is fairly good. It is certainly not perfect, but nothing ever is in the real world. The plot shows enough spread and no obvious patterns, so the equal variance condition, therefore, is met. The third and final condition that must be met is normality. The data must be normally distributed. A Q-Q plot assesses this condition. If the Q-Q plot shows a strong linear relationship then the data are normal. Our QQ plot shows a relatively strong linear relationship. Like before, it is not perfect, but it is good enough. The normality condition, therefore, is met.

Because all the conditions for inference are met, we can now infer from the data. The PPG vs YPG model has a test statistic of \(9.842055\) which corresponds to a p-value of \(6.620226 × 10^{-11}\).

Multiple Linear Regression

We also decided to perform multiple linear regression to see how if PPG is related to PYPG and RYPG. The conditions for inference for multiple linear regression are the same as simple linear regression, with one additional condition. There must not be multicollinearity. Below are plots that asses the conditions for multiple linear regression.

## `geom_smooth()` using formula = 'y ~ x'

In multiple linear regression, the linearity condition is met if all predictors have a linear relationship. This linear relationship is seen in the second and third scatterplots from the visualization section. Those scatterplots prove that all predictors have a linear relationship, even if some are not that good. The linearity condition, therefore, is met. Unlike linearity, equal variance and normality are the same as before. The residual versus fitted plot shows a decent spread across the width, so the equal variance condition, therefore, is met. The Q-Q plot shows a weak linear relationship, but there still is one. We cannot say for sure if normality is met, but we will still proceed with caution as if it is met. Multicollinearity occurs if two independent variables have a high correlation. The two independent variables are the two explanatory variables, PYPG and RYPG. The third plot, a scatterplot that shows the relationship between PYPG and RYPG, should be a good indicator of if multicollinearity is present or not. If this scatterplot showed a strong relationship, it would be evidence that multicollinearity is present. It does not, however. Additionally, variance inflation factor (VIF) is a direct measure of multicollinearity. Generally, a value of five or above means multicollinearity is present. The VIF value for both predictors is \(1.421673\), so the multicollinearity condition, therefore, is met.

All conditions for inference are met. The multiple linear regression model has two predictors, and will therefore have two test statistics and p-values. For PYPG, the test statistic is \(9.814515\) which corresponds to a p-value of \(1.007533 × 10^{-10}\). For RYPG, the test statistic is \(7.282251\) which corresponds to a p-value of \(5.086612 × 10^{-8}\).

Regression Conclusion

All p-values are essentially zero. For the simple linear regression, this means that we have proven that PPG and YPG have a linear relationship. As YPG increases, so does PPG. For the multiple linear regression,this means that we can conclude that both predictors are statistically significant. Both RPYG and PYPG, therefore, significantly affect PPG.

College Data

Every college team’s rank in PPG and YPG
Team YPG Rank PPG Rank
103 Tennessee 1 1
80 Ohio State 9 2
116 USC 3 3
3 Alabama 11 4
35 Georgia 5 5
62 Michigan 24 6
125 Washington 2 7
114 UCLA 4 8
85 Oregon 6 9
101 TCU 27 10
117 Utah 17 11
94 SMU 14 12
45 James Madison 29 13
120 UTSA 12 14

## `geom_smooth()` using formula = 'y ~ x'

## 
## Call:
## lm(formula = `Points Per Game` ~ `Yards Per Game`, data = collegeOffense)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3760 -1.8875  0.3087  1.7450  5.6176 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -13.256685   1.528327  -8.674 1.54e-14 ***
## `Yards Per Game`   0.105643   0.003854  27.410  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.672 on 129 degrees of freedom
## Multiple R-squared:  0.8535, Adjusted R-squared:  0.8523 
## F-statistic: 751.3 on 1 and 129 DF,  p-value: < 2.2e-16
## `geom_smooth()` using formula = 'y ~ x'

## 
## Call:
## lm(formula = `Points Per Game` ~ `Passing Yards Per Game` + `Rushing Yards Per Game`, 
##     data = collegeOffense)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3263 -1.8484  0.1267  1.7477  5.6455 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              -13.262958   1.531178  -8.662 1.72e-14 ***
## `Passing Yards Per Game`   0.103847   0.004591  22.621  < 2e-16 ***
## `Rushing Yards Per Game`   0.108315   0.005343  20.272  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.677 on 128 degrees of freedom
## Multiple R-squared:  0.8541, Adjusted R-squared:  0.8518 
## F-statistic: 374.5 on 2 and 128 DF,  p-value: < 2.2e-16

## Analysis of Variance Table
## 
## Response: Points Per Game
##                   Df Sum Sq Mean Sq   F value Pr(>F)    
## `Yards Per Game`   1 6683.0  6683.0 1044.5739 <2e-16 ***
## Level              1   14.1    14.1    2.1975 0.1402    
## Residuals        160 1023.7     6.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: Points Per Game
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## `Passing Yards Per Game`   1 2910.9 2910.87 112.197 < 2.2e-16 ***
## Level                      1  658.8  658.78  25.392 1.252e-06 ***
## Residuals                160 4151.1   25.94                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: Points Per Game
##                           Df Sum Sq Mean Sq F value   Pr(>F)    
## `Rushing Yards Per Game`   1 2427.4 2427.42 77.1248 2.34e-15 ***
## Level                      1  257.5  257.51  8.1819 0.004796 ** 
## Residuals                160 5035.8   31.47                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code Appendix

#load necessary packages
knitr::opts_chunk$set(echo = FALSE, dpi=300)
library("groundhog")
pkgs <- c('Stat2Data', 'tidyverse', 'mosaic', 'ggformula', 'Lock5Data', 'tinytex', 'car', 'leaps', 'HH', 'rvest', 'dplyr', 'readr', 'ggplot2', 'tidyr', 'readxl', 'esquisse', 'kableExtra', 'ggformula')
groundhog.library(pkgs, '2023-11-29') #Using the date when project was started
#create nfl table
nflTable <- read_html(
  x = "http://pfref.com/pi/share/d7riA"
) %>%
  html_elements(css = "table") %>%
  html_table()
nflOffense <- nflTable[[1]]
#delete unnecessary data
nflOffense <- subset(nflOffense, select=-c(1, 6:12, 14:18, 20:28))
nflOffense <- nflOffense[-c(1, 22:23, 36:38), ]

#rename columns 
nflOffense <- nflOffense%>%rename(
  `Team`=1, `Games`=2, `Total Points`=3, `Total Yards`=4, `Total Passing Yards`=5, `Total Rushing Yards`=6)

#squish team names
nflOffense$Team <- sub(".*\\s", "", nflOffense$Team)

#rearrange column order
#nflOffense <- nflOffense[,c("Team", "Games","Points","Yards")]

#make every cell a number versus a character
nflOffense$Games <- as.numeric(nflOffense$Games)
nflOffense$`Total Points` <- as.numeric(nflOffense$`Total Points`)
nflOffense$`Total Yards` <- as.numeric(nflOffense$`Total Yards`)
nflOffense$`Total Passing Yards` <- as.numeric(nflOffense$`Total Passing Yards`)
nflOffense$`Total Rushing Yards` <- as.numeric(nflOffense$`Total Rushing Yards`)

#make per game stats
nflOffense$`Points Per Game` <- nflOffense$`Total Points` / nflOffense$Games
nflOffense$`Yards Per Game` <- nflOffense$`Total Yards` / nflOffense$Games
nflOffense$`Passing Yards Per Game` <- nflOffense$`Total Passing Yards` / nflOffense$Games
nflOffense$`Rushing Yards Per Game` <- nflOffense$`Total Rushing Yards` / nflOffense$Games

#remove unnecessary data
nflOffense <- subset(nflOffense, select=-c(2:6))

#round down to tenths place
nflOffense$`Points Per Game` <- round(nflOffense$`Points Per Game`, digits = 1)
nflOffense$`Yards Per Game` <- round(nflOffense$`Yards Per Game`, digits = 1)
nflOffense$`Passing Yards Per Game` <- round(nflOffense$`Passing Yards Per Game`, digits = 1)
nflOffense$`Rushing Yards Per Game` <- round(nflOffense$`Rushing Yards Per Game`, digits = 1)

#rearrange columns
nflOffense <- nflOffense[, c(1, 2, 4, 5, 3)]
#create new dataframe with just points per game
nflOffensePPG <- subset(nflOffense, select=-c(3:5))
#order the rows by PPG 
nflOffensePPG <- nflOffensePPG[order(nflOffensePPG$`Points Per Game`, decreasing = TRUE),]

#create new dataframe with just yards per game
nflOffenseYPG <- subset(nflOffense, select=-c(2:4))
#order the rows by YPG
nflOffenseYPG <- nflOffenseYPG[order(nflOffenseYPG$`Yards Per Game`, decreasing = TRUE),]

nflOffensePPG <- nflOffensePPG%>%
  mutate(`PPG Rank`= c(1:32))
nflOffensePPG <- subset(nflOffensePPG, select=-c(2))

nflOffenseYPG <- nflOffenseYPG%>%
  mutate(`YPG Rank`= c(1:32))
nflOffenseYPG <- subset(nflOffenseYPG, select=-c(2))

nflRanks <- merge(nflOffenseYPG, nflOffensePPG, by = "Team", all = TRUE)
nflRanks <- nflRanks[order(nflRanks$`PPG`, decreasing = FALSE),]
nflRanks14 <- head(nflRanks, n=14)
nflRanks14%>%
  kable(
    caption="Every NFL team's rank in PPG and YPG",
    booktabs=TRUE,
    align=c("l", rep("c",6))
  )%>%
  kableExtra::kable_styling(
    bootstrap_options=c("striped", "condensed"),
    font_size=16
  )
gf_point(`Points Per Game`~`Yards Per Game`, 
         data=nflOffense, 
         linewidth=1, 
         title= 'Points Per Game vs Yards Per Game in the NFL')%>%
  gf_lm()

gf_point(`Points Per Game`~`Passing Yards Per Game`, 
         data=nflOffense, 
         linewidth=1,
         title= 'Points Per Game vs Passing Yards Per Game in the NFL')%>%
  gf_lm()

gf_point(`Points Per Game`~`Rushing Yards Per Game`, 
         data=nflOffense, 
         linewidth=1,
         title= 'Points Per Game vs Rushing Yards Per Game in the NFL')%>%
  gf_lm()
nflTotalYardsSLRModel <- lm(`Points Per Game`~`Yards Per Game`, data=nflOffense)

mplot(nflTotalYardsSLRModel, which=1)
mplot(nflTotalYardsSLRModel, which=2)

nflTotalYardsSLRSummary <- summary(nflTotalYardsSLRModel)
nflTotalYardsSLROutput <- nflTotalYardsSLRSummary[["coefficients"]]
nflMLRModel <- lm(`Points Per Game`~`Passing Yards Per Game`+`Rushing Yards Per Game`, data=nflOffense)

mplot(nflMLRModel, which=1)
mplot(nflMLRModel, which=2)
gf_point(`Passing Yards Per Game`~`Rushing Yards Per Game`, data=nflOffense)
vif <- vif(nflMLRModel)

nflMLRSummary<- summary(nflMLRModel)
nflMLROutput <- nflMLRSummary[["coefficients"]]
collegeTable <- read_html(
  x = "https://www.sports-reference.com/cfb/years/2022-team-offense.html"
) %>%
  html_elements(css = "table") %>%
  html_table()
collegeOffense <- collegeTable[[1]]

collegeOffense <- subset(collegeOffense, select=-c(1, 3, 5:7, 9:10, 12:14, 16:25))

collegeOffense <- collegeOffense%>%rename(
   `Team`=1, `Points Per Game`=2, `Passing Yards Per Game`=3, `Rushing Yards Per Game`=4, `Yards Per Game`=5
 )
collegeOffense <- collegeOffense[-c(1, 22:23, 44:45, 66:67, 88:89, 110:111, 132:133),]

collegeOffense$`Points Per Game` <- as.numeric(collegeOffense$`Points Per Game`)
collegeOffense$`Passing Yards Per Game` <- as.numeric(collegeOffense$`Passing Yards Per Game`)
collegeOffense$`Rushing Yards Per Game` <- as.numeric(collegeOffense$`Rushing Yards Per Game`)
collegeOffense$`Yards Per Game` <- as.numeric(collegeOffense$`Yards Per Game`)
#create new dataframe with just points per game
collegeOffensePPG <- subset(collegeOffense, select=-c(3:5))
#order the rows by PPG 
collegeOffensePPG <- collegeOffensePPG[order(collegeOffensePPG$`Points Per Game`, decreasing = TRUE),]

#create new dataframe with just yards per game
collegeOffenseYPG <- subset(collegeOffense, select=-c(2:4))
#order the rows by YPG
collegeOffenseYPG <- collegeOffenseYPG[order(collegeOffenseYPG$`Yards Per Game`, decreasing = TRUE),]

collegeOffensePPG <- collegeOffensePPG%>%
  mutate(`PPG Rank`= c(1:131))
collegeOffensePPG <- subset(collegeOffensePPG, select=-c(2))

collegeOffenseYPG <- collegeOffenseYPG%>%
  mutate(`YPG Rank`= c(1:131))
collegeOffenseYPG <- subset(collegeOffenseYPG, select=-c(2))

collegeRanks <- merge(collegeOffenseYPG, collegeOffensePPG, by = "Team", all = TRUE)
collegeRanks <- collegeRanks[order(collegeRanks$`PPG`, decreasing = FALSE),]
collegeRanks14 <- head(collegeRanks, n=14)

collegeRanks14%>%
  kable(
    caption="Every college team's rank in PPG and YPG",
    booktabs=TRUE,
    align=c("l", rep("c",6))
  )%>%
  kableExtra::kable_styling(
    bootstrap_options=c("striped", "condensed"),
    font_size=16,
  )

gf_point(`Points Per Game`~`Yards Per Game`, data=collegeOffense, linewidth=1)%>%gf_lm()
gf_point(`Points Per Game`~`Passing Yards Per Game`, data=collegeOffense, linewidth=1)%>%gf_lm()
gf_point(`Points Per Game`~`Rushing Yards Per Game`, data=collegeOffense, linewidth=1)%>%gf_lm()
collegeTotalYardsSLRModel <- lm(`Points Per Game`~`Yards Per Game`, data=collegeOffense)
collegeMLRModel <- lm(`Points Per Game`~`Passing Yards Per Game`+`Rushing Yards Per Game`, data=collegeOffense)

mplot(collegeTotalYardsSLRModel, which=1)
mplot(collegeTotalYardsSLRModel, which=2)
summary(collegeTotalYardsSLRModel)

mplot(collegeMLRModel, which=1)
mplot(collegeMLRModel, which=2)
summary(collegeMLRModel)
collegeOffense <- collegeOffense%>%
  mutate(`Level`= "College")
nflOffense <- nflOffense%>%
  mutate(`Level`= "NFL")
offense <- rbind(nflOffense,collegeOffense)
offense$IndLevel <- as.numeric(offense$Level=="NFL")

ggplot(offense) +
  aes(
    x = `Yards Per Game`,
    y = `Points Per Game`,
    color = Level
  ) +
  labs(
     title= "Points Per Game versus Yards Per Game by Level of Competition"
  )+
  geom_point(shape = "circle", size = 1.5) +
  scale_color_hue(direction = 1) +
  theme_minimal()

ggplot(offense) +
  aes(
    x = `Passing Yards Per Game`,
    y = `Points Per Game`,
    color = Level
  ) +
  labs(
     title= "Points Per Game versus Passing Yards Per Game by Level of Competition"
  )+
  geom_point(shape = "circle", size = 1.5) +
  scale_color_hue(direction = 1) +
  theme_minimal()

ggplot(offense) +
  aes(
    x = `Rushing Yards Per Game`,
    y = `Points Per Game`,
    color = Level
  ) +
  labs(
     title= "Points Per Game versus Rushing Yards Per Game by Level of Competition"
  )+
  geom_point(shape = "circle", size = 1.5) +
  scale_color_hue(direction = 1) +
  theme_minimal()


offenseModel <- lm(`Points Per Game` ~ `Yards Per Game`+IndLevel+`Yards Per Game`*IndLevel, data = offense)
b0 <- summary(offenseModel)$coeff[1,1]
b1 <- summary(offenseModel)$coeff[2,1]
b2 <- summary(offenseModel)$coeff[3,1]
b3 <- summary(offenseModel)$coeff[4,1]
gf_point(`Points Per Game` ~ `Yards Per Game`, color=~Level, data = offense) %>%
gf_abline(intercept = b0, slope = b1, color = "red") %>%
gf_abline(intercept = b0+b2, slope = b1+b3, color = "blue")

passingOffenseModel <- lm(`Points Per Game` ~ `Passing Yards Per Game`+IndLevel+`Passing Yards Per Game`*IndLevel, data = offense)
pb0 <- summary(passingOffenseModel)$coeff[1,1]
pb1 <- summary(passingOffenseModel)$coeff[2,1]
pb2 <- summary(passingOffenseModel)$coeff[3,1]
pb3 <- summary(passingOffenseModel)$coeff[4,1]
gf_point(`Points Per Game` ~ `Passing Yards Per Game`, color=~Level, data = offense) %>%
gf_abline(intercept = pb0, slope = pb1, color = "red") %>%
gf_abline(intercept = pb0+pb2, slope = pb1+pb3, color = "blue")

rushingOffenseModel <- lm(`Points Per Game` ~ `Rushing Yards Per Game`+IndLevel+`Rushing Yards Per Game`*IndLevel, data = offense)
rb0 <- summary(rushingOffenseModel)$coeff[1,1]
rb1 <- summary(rushingOffenseModel)$coeff[2,1]
rb2 <- summary(rushingOffenseModel)$coeff[3,1]
rb3 <- summary(rushingOffenseModel)$coeff[4,1]
gf_point(`Points Per Game` ~ `Rushing Yards Per Game`, color=~Level, data = offense) %>%
gf_abline(intercept = rb0, slope = rb1, color = "red") %>%
gf_abline(intercept = rb0+rb2, slope = rb1+rb3, color = "blue")
totalYardsANCOVAModel <- lm(`Points Per Game` ~ `Yards Per Game` + Level, data=offense)
anova(totalYardsANCOVAModel)

passingYardsANCOVAModel <- lm(`Points Per Game` ~ `Passing Yards Per Game` + Level, data=offense)
anova(passingYardsANCOVAModel)

rushingYardsANCOVAModel <- lm(`Points Per Game` ~ `Rushing Yards Per Game` + Level, data=offense)
anova(rushingYardsANCOVAModel)